Distributed Text Retrieval From Overlapping Collections

نویسندگان

  • Milad Shokouhi
  • Justin Zobel
  • Yaniv Bernstein
چکیده

In standard text retrieval systems, the documents are gathered and indexed on a single server. In distributed information retrieval (DIR), the documents are held in multiple collections; answers to queries are produced by selecting the collections to query and then merging results from these collections. However, in most prior research in the area, collections are assumed to be disjoint. In this paper, we investigate the effectiveness of different combinations of server selection and result merging algorithms in the presence of duplicates. We also test our hash-based method for efficiently detecting duplicates and near-duplicates in the lists of documents returned by collections. Our results, based on two different designs of test data, indicate that some DIR methods are more likely to return duplicate documents, and show that removing such redundant documents can have a significant impact on the final search effectiveness.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application-Embedded Retrieval from Distributed Free-Text Collections

A framework is presented for applicationembedded information retrieval from distributed free-text collections. An application’s usage is sampled by an embedded information retrieval system. Samples are converted into queries to distributed collections. Retrieval is adjusted through sample size and structure, anydata indexing, and dual space feedback. The framework is investigated with a retriev...

متن کامل

Information retrieval in digital libraries: bringing search to the net.

A digital library enables users to interact effectively with information distributed across a network. These network information systems support search and display of items from organized collections. In the historical evolution of digital libraries, the mechanisms for retrieval of scientific literature have been particularly important. Grand visions in 1960 led first to the development of text...

متن کامل

Methodologies for Distributed Information Retrieval

Text collections have traditionally been located at a single site and managed as a monolithic whole. However, it is now common for a collection to be spread over several hosts and for these hosts to be geographically separated. In this paper we examine several alternative approaches to distributed text retrieval. We report on our experience with a full implementation of these methods, and give ...

متن کامل

A Distributed Digital Library Architecture Incorporating Different Index Styles

The New Zealand Digital Library offers several collections of information over the World Wide Web. Although full-text indexing is the primary access mechanism, musical collections can also be accessed through a novel melody retrieval system. In offering this service over a three-year period, we have had to face many practical challenges in building, maintaining, and administering diverse collec...

متن کامل

Utilizing Context in Ranking Results from Distributed CBIR

Selection and ranking of relevant images from image collections remains a problem in content-based image retrieval. This problem becomes even more visible and acute when attempting to merge and rank multiple result sets retrieved from a distributed database environment. This paper presents findings from a project that investigated if combining text and image retrieval algorithms with the use of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007